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Abstract 

Inspired by recent work in machine translation 
and object detection, we introduce an attention 
based model that automatically learns to describe 
the content of images. We describe how we 
can train this model in a deterministic manner 
using standard backpropagation techniques and 
stochastically by maximizing a variational lower 
bound. We also show through visualization how 
the model is able to automatically learn to fix its 
gaze on salient objects while generating the cor¬ 
responding words in the output sequence. We 
validate the use of attention with state-of-the- 
art performance on three benchmark datasets: 
FlickrSk, Flickr30k and MS COCO. 


1. Introduction 

Automatically generating captions of an image is a task 
very close to the heart of scene understanding — one of the 
primary goals of computer vision. Not only must caption 
generation models be powerful enough to solve the com¬ 
puter vision challenges of determining which objects are in 
an image, but they must also be capable of capturing and 
expressing their relationships in a natural language. For 
this reason, caption generation has long been viewed as 
a difficult problem. It is a very important challenge for 
machine learning algorithms, as it amounts to mimicking 
the remarkable human ability to compress huge amounts of 
salient visual infomation into descriptive language. 

Despite the challenging nature of this task, there has been 
a recent surge of research interest in attacking the image 
caption generation problem. Aided by advances in training 
neural networks (Krizhevsky et al., 2012) and large clas¬ 
sification datasets (Russakovsky et al., 2014), recent work 


Figure 1 . Our model learns a words/image alignment. The visual¬ 
ized attentional maps (3) are explained in section 3.1 & 5.4 



has significantly improved the quality of caption genera¬ 
tion using a combination of convolutional neural networks 
(convnets) to obtain vectorial representation of images and 
recurrent neural networks to decode those representations 
into natural language sentences (see Sec. 2). 

One of the most curious facets of the human visual sys¬ 
tem is the presence of attention (Rensink, 2000; Corbetta & 
Shulman, 2002). Rather than compress an entire image into 
a static representation, attention allows for salient features 
to dynamically come to the forefront as needed. This is 
especially important when there is a lot of clutter in an im¬ 
age. Using representations (such as those from the top layer 
of a convnet) that distill information in image down to the 
most salient objects is one effective solution that has been 
widely adopted in previous work. Unfortunately, this has 
one potential drawback of losing information which could 
be useful for richer, more descriptive captions. Using more 
low-level representation can help preserve this information. 
However working with these features necessitates a power¬ 
ful mechanism to steer the model to information important 
to the task at hand. 

In this paper, we describe approaches to caption genera¬ 
tion that attempt to incorporate a form of attention with 




































Neural Image Caption Generation with Visual Attention 


Figure 2. Attention over time. As the model generates each word, its attention changes to reflect the relevant parts of the image, “soft” 
(top row) vs “hard” (bottom row) attention. (Note that both models generated the same captions in this example.) 
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Figure 3. Examples of attending to the correct object {white indicates the attended regions, underlines indicated the corresponding word) 


A woman is throwing a frisbee in a park. 


A dog is standing on a hardwood floor. 


A stop sign is on a road with a 
mountain in the background. 



two variants: a “hard” attention mechanism and a “soft” 
attention mechanism. We also show how one advantage of 
including attention is the ability to visualize what the model 
“sees”. Encouraged by recent advances in caption genera¬ 
tion and inspired by recent success in employing attention 
in machine translation (Bahdanau et al., 2014) and object 
recognition (Ba et al., 2014; Mnih et al., 2014), we investi¬ 
gate models that can attend to salient part of an image while 
generating its caption. 

The contributions of this paper are the following: 

• We introduce two attention-based image caption gen¬ 
erators under a common framework (Sec. 3.1): 1) a 
“soft” deterministic attention mechanism trainable by 
standard back-propagation methods and 2) a “hard” 
stochastic attention mechanism trainable by maximiz¬ 
ing an approximate variational lower bound or equiv¬ 
alently by REINFORCE (Williams, 1992). 

• We show how we can gain insight and interpret the 
results of this framework by visualizing “where” and 
“what” the attention focused on. (see Sec. 5.4) 

• Finally, we quantitatively validate the usefulness of 
attention in caption generation with state of the art 
performance (Sec. 5.3) on three benchmark datasets: 
FlickrSk (Hodosh et al., 2013) , Flickr30k (Young 
et al., 2014) and the MS COCO dataset (Lin et al., 
2014). 


2. Related Work 

In this section we provide relevant background on previous 
work on image caption generation and attention. Recently, 
several methods have been proposed for generating image 
descriptions. Many of these methods are based on recur¬ 
rent neural networks and inspired by the successful use of 
sequence to sequence training with neural networks for ma¬ 
chine translation (Cho et al., 2014; Bahdanau et al., 2014; 
Sutskever et al., 2014). One major reason image caption 
generation is well suited to the encoder-decoder framework 
(Cho et al., 2014) of machine translation is because it is 
analogous to “translating” an image to a sentence. 

The first approach to use neural networks for caption gener¬ 
ation was Kiros et al. (2014a), who proposed a multimodal 
log-bilinear model that was biased by features from the im¬ 
age. This work was later followed by Kiros et al. (2014b) 
whose method was designed to explicitly allow a natural 
way of doing both ranking and generation. Mao et al. 
(2014) took a similar approach to generation but replaced a 
feed-forward neural language model with a recurrent one. 
Both Vinyals et al. (2014) and Donahue et al. (2014) use 
LSTM RNNs for their models. Unlike Kiros et al. (2014a) 
and Mao et al. (2014) whose models see the image at each 
time step of the output word sequence, Vinyals et al. (2014) 
only show the image to the RNN at the beginning. Along 
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with images, Donahue et al. (2014) also apply LSTMs to 
videos, allowing their model to generate video descriptions. 

All of these works represent images as a single feature vec¬ 
tor from the top layer of a pre-trained convolutional net¬ 
work. Karpathy & Li (2014) instead proposed to learn a 
joint embedding space for ranking and generation whose 
model learns to score sentence and image similarity as a 
function of R-CNN object detections with outputs of a bidi¬ 
rectional RNN. Fang et al. (2014) proposed a three-step 
pipeline for generation by incorporating object detections. 
Their model first learn detectors for several visual concepts 
based on a multi-instance learning framework. A language 
model trained on captions was then applied to the detector 
outputs, followed by rescoring from a joint image-text em¬ 
bedding space. Unlike these models, our proposed atten¬ 
tion framework does not explicitly use object detectors but 
instead learns latent alignments from scratch. This allows 
our model to go beyond “objectness” and learn to attend to 
abstract concepts. 

Prior to the use of neural networks for generating captions, 
two main approaches were dominant. The first involved 
generating caption templates which were filled in based 
on the results of object detections and attribute discovery 
(Kulkami et al. (2013), Li et al. (2011), Yang et al. (2011), 
Mitchell et al. (2012), Elliott & Keller (2013)). The second 
approach was based on first retrieving similar captioned im¬ 
ages from a large database then modifying these retrieved 
captions to fit the query (Kuznetsova et al., 2012; 2014). 
These approaches typically involved an intermediate “gen¬ 
eralization” step to remove the specifics of a caption that 
are only relevant to the retrieved image, such as the name 
of a city. Both of these approaches have since fallen out of 
favour to the now dominant neural network methods. 

There has been a long line of previous work incorpo¬ 
rating attention into neural networks for vision related 
tasks. Some that share the same spirit as our work include 
Larochelle & Hinton (2010); Denil et al. (2012); Tang et al. 
(2014). In particular however, our work directly extends 
the work of Bahdanau et al. (2014); Mnih et al. (2014); Ba 
et al. (2014). 

3. Image Caption Generation with Attention 
Mechanism 

3.1. Model Details 

In this section, we describe the two variants of our 
attention-based model by first describing their common 
framework. The main difference is the definition of the 
(p function which we describe in detail in Section 4. We 
denote vectors with bolded font and matrices with capital 
letters. In our description below, we suppress bias terms for 
readability. 


Zt z. 



Figure 4. A LSTM cell, lines with bolded squares imply projec¬ 
tions with a learnt weight vector. Each cell learns how to weigh 
its input components (input gate), while learning how to modulate 
that contribution to the memory (input modulator). It also learns 
weights which erase the memory cell (forget gate), and weights 
which control how this memory should be emitted (output gate). 

3.1.1. Encoder: Convolutional Eeatures 

Our model takes a single raw image and generates a caption 
y encoded as a sequence of 1-of-iT encoded words. 

y = {yi,--•,yc'}, y* e 

where K is the size of the vocabulary and C is the length 
of the caption. 

We use a convolutional neural network in order to extract a 
set of feature vectors which we refer to as annotation vec¬ 
tors. The extractor produces L vectors, each of which is 
a D-dimensional representation corresponding to a part of 
the image. 

a = {ai,.. ..sll] , ai e 

In order to obtain a correspondence between the feature 
vectors and portions of the 2-D image, we extract features 
from a lower convolutional layer unlike previous work 
which instead used a fully connected layer. This allows the 
decoder to selectively focus on certain parts of an image by 
selecting a subset of all the feature vectors. 

3.1.2. Decoder: Long Short-Term Memory 
Network 

We use a long short-term memory (LSTM) net¬ 
work (Hochreiter & Schmidhuber, 1997) that produces a 
caption by generating one word at every time step condi¬ 
tioned on a context vector, the previous hidden state and the 
previously generated words. Our implementation of LSTM 
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closely follows the one used in Zaremba et al. (2014) (see 
Fig. 4). Using Ts^t : ^ to denote a simple affine 

transformation with parameters that are learned, 
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Ct — ft 0 Ct-i + it © gt 


ht = Ot 0 tanh(ct). 


( 1 ) 

( 2 ) 

( 3 ) 


Here, it, ft, Ct, Ot, ht are the input, forget, memory, out¬ 
put and hidden state of the LSTM, respectively. The vector 
z G is the context vector, capturing the visual infor¬ 
mation associated with a particular input location, as ex¬ 
plained below. E G is an embedding matrix. Let 

m and n denote the embedding and LSTM dimensionality 
respectively and a and 0 be the logistic sigmoid activation 
and element-wise multiplication respectively. 


In simple terms, the context vector Zf (equations (l)-(3)) is 
a dynamic representation of the relevant part of the image 
input at time t. We define a mechanism (j) that computes Zf 
from the annotation vectors a^, i = 1,..., L corresponding 
to the features extracted at different image locations. For 
each location i, the mechanism generates a positive weight 
ai which can be interpreted either as the probability that 
location i is the right place to focus for producing the next 
word (the “hard” but stochastic attention mechanism), or as 
the relative importance to give to location i in blending the 
a^’s together. The weight ai of each annotation vector 
is computed by an attention model /^tt for which we use 
a multilayer perceptron conditioned on the previous hidden 
state ht-i. The soft version of this attention mechanism 
was introduced by Bahdanau et al. (2014). For emphasis, 
we note that the hidden state varies as the output RNN ad¬ 
vances in its output sequence: “where” the network looks 
next depends on the sequence of words that has already 
been generated. 


—/att(^'i5 h-t—l) 


— 


exp(ett) 
ELi exp(e(fe) 


(4) 

(5) 


through two separate MLPs (init,c and init,h): 



In this work, we use a deep output layer (Pascanu et al., 
2014) to compute the output word probability given the 
LSTM state, the context vector and the previous word: 

p{yt\a, yE^) oc exp(Lo(Eyt_i + + L^zj)) (7) 


Where Lo e Lh e R™^", G and E 

are learned parameters initialized randomly. 


4. Learning Stochastic “Hard” vs 
Deterministic “Soft” Attention 

In this section we discuss two alternative mechanisms for 
the attention model /^tt- stochastic attention and determin¬ 
istic attention. 


4.1. Stochastic “Hard” Attention 

We represent the location variable St as where the model 
decides to focus attention when generating the word. 
St^i is an indicator one-hot variable which is set to 1 if the 
i-th location (out of L) is the one used to extract visual 
features. By treating the attention locations as intermedi¬ 
ate latent variables, we can assign a multinoulli distribution 
parametrized by and view Zt as a random variable: 

p(st,i = 1 I Si<t,a) = (8) 

Zt = L]st,iai. (9) 

i 

We define a new objective function Lg that is a variational 
lower bound on the marginal log-likelihood logp(y | a) of 
observing the sequence of words y given image features a. 
The learning algorithm for the parameters W of the models 
can be derived by directly optimizing Lg : 

Ls = I a) logp(y | s, a) 


Once the weights (which sum to one) are computed, the 
context vector Zt is computed by 


Zt = (j) {{ai} , {ai}), (6) 


<logL]p(s I a)p{y I s,a) 

S 

= logp(y|a) (10) 


where 0 is a function that returns a single vector given the 
set of annotation vectors and their corresponding weights. 
The details of (j) function are discussed in Sec. 4. 

The initial memory state and hidden state of the LSTM 
are predicted by an average of the annotation vectors fed 


dig 

dW 




logp(y I s,a) 


()logp(s I a) 

mv 


( 11 ) 
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Figure 5. Examples of mistakes where we can use attention to gain intuition into what the model saw. 



A large white bird standing in a forest. A woman holding a clock in her hand. a man wearing a hat and 


a hat on a skateboard. 



A person is standing on a beach A woman is sitting at a table A man is talking on his cell phone 

with a surfboard. with a large pizza. while another man watches. 


Equation 11 suggests a Monte Carlo based sampling ap¬ 
proximation of the gradient with respect to the model pa¬ 
rameters. This can be done by sampling the location St 
from a multinouilli distribution defined by Equation 8. 

St ^ MultinoulliL({cti}) 


dLs 

dW 



n=l 


51ogp(y I s”,a) 
dW ^ 


logp(y I s",a) 


91ogp(s” 


|a)l 


( 12 ) 


A moving average baseline is used to reduce the vari¬ 
ance in the Monte Carlo estimator of the gradient, follow¬ 
ing Weaver & Tao (2001). Similar, but more complicated 
variance reduction techniques have previously been used 
by Mnih et al. (2014) and Ba et al. (2014). Upon seeing the 
f^th mini-batch, the moving average baseline is estimated 
as an accumulated sum of the previous log likelihoods with 
exponential decay: 


bk = 0.9 X bk-i + 0.1 X logp(y | 4, a) 

To further reduce the estimator variance, an entropy term 
on the multinouilli distribution H[s] is added. Also, with 
probability 0.5 for a given image, we set the sampled at¬ 
tention location s to its expected value a. Both techniques 
improve the robustness of the stochastic attention learning 
algorithm. The final learning rule for the model is then the 


following: 


dLs _ y A plogp(y I a", a) 
gw N ^ dW ^ 

n=l 


Ar(logp(y I s",a) - 6) 


c)logp(s” 


I a) 


dH[s^y 

dW 


where, and Ae are two hyper-parameters set by cross- 
validation. As pointed out and used in Ba et al. (2014) 
and Mnih et al. (2014), this is formulation is equivalent to 
the REINEORCE learning rule (Williams, 1992), where the 
reward for the attention choosing a sequence of actions is 
a real value proportional to the log likelihood of the target 
sentence under the sampled attention trajectory. 

In making a hard choice at every point, 0 ({ai} , {a^}) 
from Equation 6 is a function that returns a sampled at 
every point in time based upon a multinouilli distribution 
parameterized by a. 


4.2. Deterministic “Soft” Attention 

Learning stochastic attention requires sampling the atten¬ 
tion location St each time, instead we can take the expecta¬ 
tion of the context vector Zt directly, 

L 

^p{st\a)[h]=^at,i^i (13) 

i=l 

and formulate a deterministic attention model by com¬ 
puting a soft attention weighted annotation vector 
0 ({ai} , {o^i}) = Yld introduced by Bahdanau 

et al. (2014). This corresponds to feeding in a soft a 
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weighted context into the system. The whole model is 
smooth and differentiable under the deterministic attention, 
so learning end-to-end is trivial by using standard back- 
propagation. 

Learning the deterministic attention can also be under¬ 
stood as approximately optimizing the marginal likelihood 
in Equation 10 under the attention location random vari¬ 
able St from Sec. 4.1. The hidden activation of LSTM 
ht is a linear projection of the stochastic context vector 
Zt followed by tanh non-linearity. To the first order Tay¬ 
lor approximation, the expected value [hf] is equal 

to computing using a single forward prop with the ex¬ 
pected context vector [zt]. Considering Eq. 7, let 

nt = Lo(Eyt_i+L/iht+L^Zt), denotes computed 
by setting the random variable z value to . We define the 
normalized weighted geometric mean for the softmax 
word prediction: 


NWGM[p{yt = k \ a)] = 


Hi exp(nt 


exp(Ep(s^|a) [nt,k]) 

Ejexp(Ep(s^l„)[ntj]) 


The equation above shows the normalized weighted ge¬ 
ometric mean of the caption prediction can be approxi¬ 
mated well by using the expected context vector, where 
E[nt] = Lo(Eyt_i + L?,E[ht] + L^E[zt]). It shows that 
the NWGM of a softmax unit is obtained by applying soft- 
max to the expectations of the underlying linear projec¬ 
tions. Also, from the results in (Baldi & Sadowski, 2014), 
NWGM[p{yt = /c I a)] « ^[p{yt = ^ I ^)] under 
softmax activation. That means the expectation of the out¬ 
puts over all possible attention locations induced by ran¬ 
dom variable St is computed by simple feedforward propa¬ 
gation with expected context vector E[zt]. In other words, 
the deterministic attention model is an approximation to the 
marginal likelihood over the attention locations. 


the scalar [3. 

Concretely, the model is trained end-to-end by minimizing 
the following penalized negative log-likelihood: 

L C 

Ld = -log(P(y|x)) + A^(l-^a*i)2 (14) 

i t 


4.3. Training Procedure 

Both variants of our attention model were trained with 
stochastic gradient descent using adaptive learning rate al¬ 
gorithms. Eor the ElickrSk dataset, we found that RM- 
SProp (Tieleman & Hinton, 2012) worked best, while for 
Elickr30k/MS COCO dataset we used the recently pro¬ 
posed Adam algorithm (Kingma & B a, 2014) . 

To create the annotations used by our decoder, we used 
the Oxford VGGnet (Simonyan & Zisserman, 2014) pre¬ 
trained on ImageNet without finetuning. In principle how¬ 
ever, any encoding function could be used. In addition, 
with enough data, we could also train the encoder from 
scratch (or fine-tune) with the rest of the model. In our ex¬ 
periments we use the 14x14x512 feature map of the fourth 
convolutional layer before max pooling. This means our 
decoder operates on the flattened 196 x 512 (i.e L x D) 
encoding. 

As our implementation requires time proportional to the 
length of the longest sentence per update, we found train¬ 
ing on a random group of captions to be computationally 
wasteful. To mitigate this problem, in preprocessing we 
build a dictionary mapping the length of a sentence to the 
corresponding subset of captions. Then, during training we 
randomly sample a length and retrieve a mini-batch of size 
64 of that length. We found that this greatly improved con¬ 
vergence speed with no noticeable diminishment in perfor¬ 
mance. On our largest dataset (MS COCO), our soft atten¬ 
tion model took less than 3 days to train on an NVIDIA 
Titan Black GPU. 


4.2.1. Doubly Stochastic Attention 

By construction, au = 1 as they are the output of a 
softmax. In training the deterministic version of our model 
we introduce a form of doubly stochastic regularization, 
where we also encourage ~ 1- This can be in¬ 

terpreted as encouraging the model to pay equal attention 
to every part of the image over the course of generation. In 
our experiments, we observed that this penalty was impor¬ 
tant quantitatively to improving overall BLEU score and 
that qualitatively this leads to more rich and descriptive 
captions. In addition, the soft attention model predicts a 
gating scalar /3 from previous hidden state ht-i at each 
time step t, such that, 0 ({a^} , {(Ti}) = P Yld where 
Pt = <T(// 3 (ht-i)). We notice our attention weights put 
more emphasis on the objects in the images by including 


In addition to dropout (Srivastava et al., 2014), the only 
other regularization strategy we used was early stopping 
on BLEU score. We observed a breakdown in correla¬ 
tion between the validation set log-likelihood and BLEU in 
the later stages of training during our experiments. Since 
BLEU is the most commonly reported metric, we used 
BLEU on our validation set for model selection. 

In our experiments with soft attention, we also used Whet- 
lab^ (Snoek et al., 2012; 2014) in our ElickrSk experi¬ 
ments. Some of the intuitions we gained from hyperparam¬ 
eter regions it explored were especially important in our 
Elickr30k and COCO experiments. 

We make our code for these models based in Theano 

^ https://www.whetlab.com/ 
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Table 1. BLEU-1,2,3,4/METEOR metrics compared to other methods, f indicates a different split, (—) indicates an unknown metric, o 
indicates the authors kindly provided missing metrics by personal communication, S indicates an ensemble, a indicates using AlexNet 



BLEU 


Dataset 

Model 

BLEU-1 

BLEU-2 

BLEU-3 

BLEU-4 

METEOR 


Google NIC(Vinyals et al., 2014)^^ 

63 

41 

27 

— 

— 


Log Bilinear (Kiros et al., 2014a)° 

65.6 

42.4 

27.7 

17.7 

17.31 

niojvi ojx 

Soft-Attention 

67 

44.8 

29.9 

19.5 

18.93 


Hard-Attention 

67 

45.7 

31.4 

21.3 

20.30 


Google Nicn^ 

66.3 

42.3 

27.7 

18.3 

— 


Log Bilinear 

60.0 

38 

25.4 

17.1 

16.88 

FllCKToUK 

Soft-Attention 

66.7 

43.4 

28.8 

19.1 

18.49 


Hard-Attention 

66.9 

43.9 

29.6 

19.9 

18.46 


CMU/MS Research (Chen & Zitnick, 2014)"^ 

— 

— 

— 

— 

20.41 


MS Research (Eang et al., 2014)^" 

— 

— 

— 

— 

20.71 


BRNN (Karpathy & Li, 2014)° 

64.2 

45.1 

30.4 

20.3 

— 

COCO 

Google NICt°^ 

66.6 

46.1 

32.9 

24.6 

— 


Log Bilinear® 

70.8 

48.9 

34.4 

24.3 

20.03 


Soft-Attention 

70.7 

49.2 

34.4 

24.3 

23.90 


Hard-Attention 

71.8 

50.4 

35.7 

25.0 

23.04 


(Bergstra et al., 2010) publicly available upon publication 
to encourage future research in this area. 

5. Experiments 

We describe our experimental methodology and quantita¬ 
tive results which validate the effectiveness of our model 
for caption generation. 

5.1. Data 

We report results on the popular FlickrSk and FlickrSOk 
dataset which has 8,000 and 30,000 images respectively 
as well as the more challenging Microsoft COCO dataset 
which has 82,783 images. The Flickr8k/Flickr30k dataset 
both come with 5 reference sentences per image, but for 
the MS COCO dataset, some of the images have references 
in excess of 5 which for consistency across our datasets 
we discard. We applied only basic tokenization to MS 
COCO so that it is consistent with the tokenization present 
in FlickrSk and FlickrSOk. For all our experiments, we used 
a fixed vocabulary size of 10,000. 

Results for our attention-based architecture are reported in 
Table 4.2.1. We report results with the frequently used 
BLEU metric^ which is the standard in the caption gen¬ 
eration literature. We report BLEU from 1 to 4 with- 

^We verified that our BLEU evaluation code matches the au¬ 
thors of Vinyals et al. (2014), Karpathy & Li (2014) and Kiros 
et al. (2014b). Eor fairness, we only compare against results for 
which we have verified that our BLEU evaluation code is the 
same. With the upcoming release of the COCO evaluation server, 
we will include comparison results with all other recent image 
captioning models. 


out a brevity penalty. There has been, however, criticism 
of BLEU, so in addition we report another common met¬ 
ric METEOR (Denkowski & Lavie, 2014), and compare 
whenever possible. 

5.2. Evaluation Procedures 

A few challenges exist for comparison, which we explain 
here. The first is a difference in choice of convolutional 
feature extractor. Eor identical decoder architectures, us¬ 
ing more recent architectures such as GoogLeNet or Ox¬ 
ford VGG Szegedy et al. (2014), Simony an & Zisserman 
(2014) can give a boost in performance over using the 
AlexNet (Krizhevsky et al., 2012). In our evaluation, we 
compare directly only with results which use the compa¬ 
rable GoogLeNet/Oxford VGG features, but for METEOR 
comparison we note some results that use AlexNet. 

The second challenge is a single model versus ensemble 
comparison. While other methods have reported perfor¬ 
mance boosts by using ensembling, in our results we report 
a single model performance. 

Einally, there is challenge due to differences between 
dataset splits. In our reported results, we use the pre¬ 
defined splits of Elickr8k. However, one challenge for the 
Elickr30k and COCO datasets is the lack of standardized 
splits. As a result, we report with the publicly available 
splits^ used in previous work (Karpathy & Li, 2014). In our 
experience, differences in splits do not make a substantial 
difference in overall performance, but we note the differ- 

^http://cs.Stanford.edu/people/karpathy/ 
deepimagesent/ 
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ences where they exist. 

5.3. Quantitative Analysis 

In Table 4.2.1, we provide a summary of the experi¬ 
ment validating the quantitative effectiveness of attention. 
We obtain state of the art performance on the FlickrSk, 
FlickrSOk and MS COCO. In addition, we note that in our 
experiments we are able to significantly improve the state 
of the art performance METEOR on MS COCO that we 
speculate is connected to some of the regularization tech¬ 
niques we used 4.2.1 and our lower level representation. 
Einally, we also note that we are able to obtain this perfor¬ 
mance using a single model without an ensemble. 

5.4. Qualitative Analysis: Learning to attend 

By visualizing the attention component learned by the 
model, we are able to add an extra layer of interpretabil- 
ity to the output of the model (see Eig. 1). Other systems 
that have done this rely on object detection systems to pro¬ 
duce candidate alignment targets (Karpathy & Li, 2014). 
Our approach is much more flexible, since the model can 
attend to “non object” salient regions. 

The 19-layer OxfordNet uses stacks of 3x3 filters mean¬ 
ing the only time the feature maps decrease in size are due 
to the max pooling layers. The input image is resized so 
that the shortest side is 256 dimensional with preserved as¬ 
pect ratio. The input to the convolutional network is the 
center cropped 224x224 image. Consequently, with 4 max 
pooling layers we get an output dimension of the top con¬ 
volutional layer of 14x14. Thus in order to visualize the 
attention weights for the soft model, we simply upsample 
the weights by a factor of 2^ = 16 and apply a Gaussian 
filter. We note that the receptive fields of each of the 14x14 
units are highly overlapping. 

As we can see in Eigure 2 and 3, the model learns align¬ 
ments that correspond very strongly with human intuition. 
Especially in the examples of mistakes, we see that it is 
possible to exploit such visualizations to get an intuition as 
to why those mistakes were made. We provide a more ex¬ 
tensive list of visualizations in Appendix A for the reader. 

6. Conclusion 

We propose an attention based approach that gives state 
of the art performance on three benchmark datasets us¬ 
ing the BLEU and METEOR metric. We also show how 
the learned attention can be exploited to give more inter- 
pretability into the models generation process, and demon¬ 
strate that the learned alignments correspond very well to 
human intuition. We hope that the results of this paper will 
encourage future work in using visual attention. We also 
expect that the modularity of the encoder-decoder approach 


combined with attention to have useful applications in other 
domains. 
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A. Appendix 

Visualizations from our “hard” (a) and “soft” (b) attention model. White indicates the regions where the model roughly 
attends to (see section 5.4). 



(a) A man and a woman playing frisbee in a field. 



(b) A woman is throwing a frisbee in a park. 


Figure 6. 
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(a) A giraffe standing in the field with trees. 



(b) A large white bird standing in a forest. 


Figure 7. 
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(a) A dog is laying on a bed with a book. 



(b) A dog is standing on a hardwood floor. 


Figure 8. 
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(a) A woman is holding a donut in his hand. 



(b) A woman holding a clock in her hand. 


Figure 9. 
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(a) A stop sign with a stop sign on it. 



(b) A stop sign is on a road with a mountain in the background. 


Figure 10. 
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(a) A man in a suit and a hat holding a remote control. 



(b) A man wearing a hat and a hat on a skateboard. 


Figure 11 . 
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(a) A little girl sitting on a couch with a teddy bear. 



(b) A little girl sitting on a bed with a teddy bear. 
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(a) A man is standing on a beach with a surfboard. 



(b) A person is standing on a beach with a surfboard. 
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(a) A man and a woman riding a boat in the water. 



(b) A group of people sitting on a boat in the water. 


Figure 12. 
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(a) A man is standing in a market with a large amount of food. 



(b) A woman is sitting at a table with a large pizza. 


Figure 13. 























Neural Image Caption Generation with Visual Attention 



(a) A giraffe standing in a field with trees. 



(b) A giraffe standing in a forest with trees in the background. 


Figure 14. 









Neural Image Caption Generation with Visual Attention 



• H 




1 


m 

■IJ 


(a) A group of people standing next to each other. 













